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Abstract 

Hilbergl Jl990h supposed that finite-order excess entropy of a random human text is 
proportional to the square root of the text length. Assuming that Hilberg's hypothesis is 
true, we derive Guiraud's law, which states that the number of word types in a text is greater 
than proportional to the square root of the text length. Our derivation is based on some 
mathematical conjecture in coding theory and on several experiments suggesting that words 
can be defined approximately as the nonterminals of the shortest context-free grammar for 
the text. Such operational definition of words can be applied even to texts deprived of 
spaces, which do not allow for Mandelbrot's "intermittent silence" explanation of Zipf 's 
and Guiraud's laws. In contrast to Mandelbrot's, our model assumes some probabilistic 
long-memory effects in human narration and might be capable of explaining Menzerath's 
law. 
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1 Introduction 



Over a decade ago. IHilberg ( 199dl) reinterpreted Shannon's ( 1950h well-known experimental 
data and formulated a novel hypothesis concerning the entropy of human language. The hy- 
pothesis states that block entropy H (n) of a text drawn from natural language production, ex- 
cept for disputable constant and linear terms, is proportional to the square root of the text length 
n measured in phonemes (or letters), 



H(n) ~ ho + h^rf + hn, 



(1) 



where /j fa 1/2. For brevity, we call relation ([TJ Hilberg's law. Hilberg's publication appeared in 
a tec hnical journal of telecommunications. It was popu larized among natural scientists by Ebel- 
ing febeling and NicolisL 1 199 lL lEbeling and Poschell Il994h and stimulated some discussi ons 
dRialek et al 11200 lUCrutchfi eld and Fel dm atl 1200.4 IShalizilli^ 

In this article, we shall discuss some interaction b etween Hilberg' s law and the better known 
Guiraud's and Zipf's laws. Empirical Guiraud's law (Gu iraudL |l954) states that the number of 
orthographic word types V in a text behaves like 



V °ciV p , 



(2) 
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where p < 1 is constant and N is the length o f the text measured in orthographic w ord tokens. 
On the other hand, Zipf's-Mandelbrot's law (Zinl Il935l 1 19491: iMandel brol Il954h states that 
any text obeys relation 



c(w) oc — — - (3) 

where B > 1 is constant, frequency c(w) is the count of word w in the text, and rank r(w) is the 
position of word w in the list of words sorted in descending order by c(w). 

We do not know to what extent Hilberg's law is valid. Formula presupposes some sta- 
tionary probabilistic model of the entire natural language production, which is a highly hypo- 
thetical entity itself. Nevertheless, we would like to argue that some form of Guiraud's law can 
be deduced from equation CO). Strictly speaking, assuming that Hilberg's law is true for all n, 
we shall only infer some lower bound for the growth of the vocabulary size. Despite that restric- 
tion, we think that our explanation of Guiraud's law can be more lin guistically plausibl e than 
the famous joint derivation of Guiraud's and Zipf's laws provided by Mandelbrot ( 1953). The 
latter derivation is known also as "intermittent silence" explanation (Mil led. 119571: lLilll998|) . 

Hilberg's law concerns the probabilistic distribution of arbitrary phoneme or letter strings, 
i.e. the law constrains the distribution of all human texts. On the other hand, both Guiraud's 
and Zipf's laws concern the distribution of individual words in texts. Saying that Guiraud's 
law can be deduced from Hilberg's law, we presuppose some procedure which transforms the 
distribution of phoneme strings (i.e. texts) into the corresponding distribution of words. In 
some naive approach, we could assume that the text is a string of phonemes or spaces and the 
words are the space-to-space strings of phonemes. In fact, "intermittent silence" explanation 
assumes that the text is a string of probabilistically independent random tokens taking the values 
of spaces and phonemes. Given this assumption and the space-to-space definition of word, 



Mandelbrot deduced Zipf's law, and hence Guiraud's law can be deduced as well (|Kornai . 

El. 

Unfortunately, "intermittent silence" explanation cannot be applied to natural language. 
We know that the occurrences of phonemes in the language production exhibit some strong 
probabilistic dependence and there are no definite spaces between the words in human speech 



(Jelinek, 1997). If we want to derive Zipf's law from the distribution of mere phoneme strings, 
we must use some definition of word tokens which could be applied to the text deprived of 
spaces and which would match empirically the definition of word tokens given by spelling con- 
ventions or by semantic considerations. 

Some well-defined tokenization of the space-deprived text into word-like strings can be 



given by grammar-based text compression (Kie fferand Yang i EoOO). In grammar-based com 



pression, the text is represented as a special context-free grammar, called an admissible gram- 
mar. That class of context-free grammars should not be confused with phrase structure gram- 
mars: The nonterminals of admissible grammars correspond to fixed strings of phonemes rather 
than to part-of- speech classes. Each admissible grammar gives some tokenization of the text 
into hierarchically structured word-like strings being the nonterminal tokens. It was empirically 
confirmed that for the grammars which approximate the shortest admissibl e grammar for a hu- 
man text, the nonterm inals usually correspond to the orthographic words (|de MarckenL fl996: 
Nevil l-Man ningl Il996h . 



We will show that the expected number of nonterminal types for the shortest admissible 
grammar cannot be less than proportional to so called finite-order excess entropy of the random 
text. It is some mathematical result based on a line of theorems and one unproved conjecture. 
On the other hand, if Hilberg's hypothesis is true then the finite-order excess entropy of the text 
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is roughly proportional to the square root of the text length. The close empirical correspondence 
between the nonterminals and the orthographic words allows us to claim that Hilberg's law 
implies some lower bound for the vocabulary growth, i.e. some form of Guiraud's law. 

The rest of this article fills in the details of the deductions and empirical observations men- 
tioned in the previous paragraphs: 

• In section El we introduce the definitions of stationary distribution, block entropy, ex- 
cess entropy, and infinitary distributions. We sketch the history of Hilberg's law and the 
general research of block entropy for natural language production. 

• In section^ we introduce the concepts of admissible and irreducible grammars. We also 
discuss some empirical evidence that the shortest admissible grammar matches largely 
the linguistic tokenization for the human text. 

• In section |H we relate block entropy to the expected length of irreducible grammar-based 
codes. Assuming some mathematical conjecture, we show that the expected total length 
of the non-initial productions of the shortest grammar cannot be less than finite-order 
excess entropy. 

• In section|5l we discuss Guiraud's law in detail and we argue that Hilberg's law explains 
it better than the assumption of "intermittent silence". Some arguments for Hilberg's 
law explanation are: (i) non-randomness of texts, (ii) empirical detectability of word 
boundaries and internal structures, (iii) possibility of explaining Menzerath's law, and 
(iv) significant variation of word frequencies across different texts. 



2 Excess entropy and Hilberg's law 

Let us imagine some infinite sequence of characters, e.g. 

t he_r o s e_ i s_a_ho s e_ i s_a_r o s e_ i s_a_ho s e_ i s_a_r o s e_ i s_a_ho s e . . . , (4) 

where subsequence _a_rose_is_a_hose_is is repeated infinitely to fix our imagination. For 
such an (infinite) sequence we can compute the relative frequency of any (finite) string which 
appears in that sequence. 

For example, let us define probability P(rose) as the relative frequency of string rose in 
the infinite sequence ©. We shall do it in two steps. Let a\ stand for the z'th character of 
©, i.e. a\ =t,a,2 = h, = e, = _, a$ = r etc. We will write the finite substrings as 
o-m.n '■= (am, a m+i, •••,#«)• The relative frequency P(rose;rc) of string rose in string a\- n is the 
number of all positions a,-, 1 < / < n, where string rose starts divided by n. For any equality 
relation § let us define [(])] = 1 if (]) is true and [<])] = if (]) is false. Thus, P(rose;fc) can be 
expressed as 

1 k 

P(rose;fc) := - £ [ao+3 = rose] , (5) 
k i=i 

where := means definition. We have P(rose; 1) = 0, P(rose;5) = 1/5, P(rose; 10) = 1/10, 
P(rose;30) = 2/30 and so on. 

Let us define probability P(rose) as the limit of relative frequencies of string rose in the 
initial substrings of ©. So we will write 

P(rose) := lim P(rose;n). (6) 
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Every 20th character in sequence © is a position where string rose starts, so P(rose) = 1/20. 
Analogically, we can define probability P(v) for any string v, 

1 * 

P(v):=lim-£[a /:j+] env-i=vl, (7) 

1=1 

where lenv is the number of characters in v. Hence, for © we obtain not only P(t) = (there 
are no t's), P(s) = 1/5 (two in ten characters are s), and P(e) = 1/10 but also P(e_is_a) = 
1/10, P(a_rose_is_a_hose) = 1/20, and P(a_rose_is_a_rose) =0. 

Now let us take some general sequence (oi, 02,03, ...). Let V be the finite set of characters 
that appear in that sequence. Let V + be the set of all finite strings formed by concatenating 
the characters in V. For any sequence (01,02,03,...) such that limit © exists for each string 
v G V + , probability function P satisfies relations 

0<P(v)<l, £P(o) = l, £P(ov)=P(v)= £P(vo). (8) 

aeV aeV aeV 

We will call any function P satisfying conditions © for all v G V + a stationary distribution. 1 It 
is an open question whether for any stationary distribution P exists such (01,02,03, ...) that we 
have © for all v G V+. 

Let V w be the set of all n-character long strings. We define block entropy H(n) of any 
stationary distribution P as the entropy of strings of length n, 

H(n):=- £ P(v)log 2 P(v). (9) 

veV" 

We also put H(0) : = for algebraic convenience. 

For any st ationary distribution P block entr opy H(n) is a nonnegative, growing, and concave 
function of n (jCrutchfield and FeldmanLl2003|) . i.e., 

H{n) > 0, H'(n) > 0, H"(n) < 0, (10) 

where 

H'(n):=H(n)-H(n-l), H"(n) :=H(n) -2H(n- 1) +H{n-2). (11) 
Because of inequalities dTOT). we can define entropy rate as 

h:= lira H(n)/n= lira H\n)>0. (12) 

n — >oo n — >oo 

If entropy rate satisfies h > then H(n) grows almost linearly against the string length n for 
very long strings, H{n) ~ hn. We can ask how fast H(n) approaches hn. The departure of H(n) 
from the linear growth is known as excess entropy. 

Finite-order excess entropies E(n) are some functions of H (n) and H(2n), 

n In 

E{n):=2H(n)-H(2n) = -Y,{k-l)H"{k)- £ (2n-k+l)H"(k). (13) 

k=2 k=n+l 

Stationary distributions are the distributions of stationary stochastic processes For simplicity, 

we avoid the mathemati cal terms of stochastic processes, random variables and probabilistic spaces (Billingslev, 
ll979HKallenberglll997l) . Since we do not need these notions to present the core reasonings, we ignore them to 
make the article as elementary as possible. 
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So d efined functions are nonnegative and growing, i.e., E(ri) >E(n—\)>0. ICrutchfield and Feldman 



(2003) proved that (total) excess entropy E can be defined equivalently as 

E:= lim£(n)= Mm [H (n)-hn\. (14) 

n — >o° n — >oo 

We also have inequality 

oo oo 

£ = l)H"(k) > - £ H"(k) =H(l)—h. (15) 

k=2 k=2 

Let vu be the concatenation of strings v and u. We will say that stationary distribution P is 
an IK) distribution if 

P(vu) = P(v)P(«) (16) 

for all strings v, m G V + . (IID stands for independent identically distributed random variables.) 
Distributions P can be IK) even for some quite ordered underlying sequences (ai, 02,03, ...). For 
instance, P given through © is IID for the sequence of digits of consecutive natural numbers 
(01,07,03, ...) = (1,2 , 3,4, 5,6,7,8,9, 1,0, 1, 1, — ), which is called Champernowne sequence 



(|Li and Vitanyil 119931) . Anyway, we do not expect that we could obtain IID distribution P if 
we substituted some collection of human texts for sequence (01,02,03,...). 

For any IID distribution P we have H(n) = nH(l) so h = H(l) and E = 0. Conversely, if 
H(\) — h > or E > 0, then distribution P cannot be IID. For the extreme departures from the 
IID case, we have h = or E = °°. Stationary distributions exhibiti ng h — are called deter- 



minis tic while the distributions obeying E = °° are called infinitary (Crutchfi eld and Feldman. 



2003). In appendix iBl we present some properties of infinitary distributions which could be 
important for their possible applications in quantitative and computational linguistics but which 
are not so relevant for the main reasoning of this article. 

Let us assume that we could obtain some definite stationary distribution P through formula 
if we substituted the infinite concatenation of some human texts for (01,02,6(3,...). We 
will call such an infinite sequence (01,02,03, ...) natural language production. Research in the 
hypothetical stationary distribution of natural lang uage pr oduction has attracted many scientists. 



The first one to work in this area was Shannon (1950). He tried to estimate block entropy 



using the guessing method and assuming some correspondence between particular instances of 
English texts and the hypothetical random English language production. Shannon published 
some estimates of H[n) for strings of n consecutive letters, where n < 100. 

Shannon was not convin ced of any par ticular asymptotics of block entropy H(n) for the 
natural language production (Hilberg. ll990l) but the later res earchers in quantitative linguistics 



inguistics 

tried to model H(n) by some simple formulae. For example, Hoffmann and PiotrovskTjM 19791) 
proposed a model of exponential convergence, 

H(n)/n=(ho — h)exp[—n/no]+h. (17) 



Petroval ( 19731) fitted m odel (fTTt to Fre nch language data and obtained 1/hq g (0.24, 0.33). 



03} to 

On the other hand, Hilberg I (|l990t) replotted the original plot of H(n) vs. n by Shannon 



! 1950) into a log-log scale and observed that a simple square-root dependence fits all the data 
points, 

#(n)°cn /i , jU«l/2, n<100. (18) 
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For our convenience, we will call Hilberg's law an algebraic relation which is slightly more 
general than Hilberg's original hypothesis (TT81) . We will say that Hilberg's law holds for any 
stationary distribution P if only relation ([TJ holds with /u « 1/2 and h M > for any n. For such 
definition, Hilberg's law is independent of any hypothesis on the particular value of entropy rate 
h and the constant term ho. 

While Shannon estimated block entropy using the guessing method, Ebeling and his collab- 
orators tried to estimate the asymptotics of H(n) by counting rc-tuples in the samples of various 
symbolic sequences. Using improved entropy estimators, the researchers fitted the general for- 
mula © with /j « 1/2 for natural language texts and psi 1/4 for classical music transcripts. 
For English and German texts H(n) could be safely estimated for n < 30 characters wit h ho ps 0, 
hfi « 3.1 bits and h ps 0.4 bits (pbeling and Nicolislll 992: Ebelin g and Posch el. 1994). In con- 
trast, Shannon's guessing data, reinterpreted by iHilberg l (|1990l) . suggest that equation (HJ) can 
be extrapolated at least for n < 100. 

It is important to note that the estimation of block entropy H(n) based on the naive estima- 
tion of probabilities P(v) for all strings v of length n is expensive in the input data. In o rder to 
estimate the value of H{n), we need a sample of length about 

2 H(n) (Herz el eta1.lll994h . If we 



try to make shor tcuts, w e assum e some particular properties of the unknown stationary distri- 
bution P. Even Shanno 3s £950) guessing method need not give the reliable estimates of H[n) 



for the language production if the probabilistic language m odel internalized by the exp erimental 



subjects differs from the model estimated from the corpus (IBod et all 120 03: Hu gi ll 9971) . 



Let us note that for the block entropy of formula CQ), finite-order excess entropies are 

£(n)wAo+(2-2^/. (19) 

If relations (1T91 hold with < /j < 1 for any n then the total excess entropy is E = °°. Hence, 
every stationary distribution exhibiting Hilberg's law is infinitary. 

At the moment, we have no clear idea how one could verify if Hilberg's law holds for the 
hypothetical stationary distribution of the language production. Nevertheless, we can provide 
some mixed inductive and deductive arguments that Hilberg's law implies some phenomena 
that can be observed in human language. 



3 Words and the shortest grammars 

In the following sections, we shall argue that Hilberg's law can explain some quantitative laws 
concerning the distribution of word types in the language production. Nevertheless, before we 
can speak of any distribution of words in a finite string of phonemes or letters, we need to 
delimit the word tokens themselves. If the words are some objective entities of the language, 
there should be some method for identifying the boundaries between the words in a sufficiently 
long string of phoneme or letter tokens even if we delete the spaces between words and ignore 
the lexicon. 

Let us take some text deprived of spaces, e.g. 

v = shouldawoodchuckchuckif awoodchuckcouldchuckwood. (20) 

We can express our knowledge of word tokens describing string v by means of a two-level 
context-free grammar 

{b ^ b 5 bib-]b 6 b2bib 7 b4b 6 b3, fci^a, 1 

bj i— > if, ^3 i— > wood, Z?4 i — > could, > . (21) 
b$ i — > should, bd i — > chuck, b-j i— ► woodchuck J 
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Symbols bi are called nonterminals. For each b\ there is some production rule (bi \— > gi) E 
G. On the other hand, the typewriter- typed symbols, which have no productions rules in the 
grammar, will be called terminals. Nonterminal bo is called the initial symbol of the grammar. 
If we recursively substitute productions g,- for all nonterminals bi where (bi i— ► gi) E G, then bo 
expands into string v with the requested tokenization into the words. Namely, 



v = should a woodchuck chuck if a woodchuck could chuck wood , 
where notation g means that G contains rule bi i— > g for some i ^ (|de Marcken . .1996). 



Of course, if we were not given any previous knowledge of English lexicon, we could pro- 
pose other tokenizations for text (l20b . For instance, 



bo i — > shZ?i^4^2if bi I cb\b2bi, b\ \— > ould, 
Z?2 i— > chuck, b-i i — > wood, b^ \— > sb^bi 



(22) 



yields 



v = should a wood chuck chuck if awood chuck could chuck wood. 



In the extreme, we could define bo as the entire string v or each bu i ^ 0, as a single letter. Since 
we ignore English lexicon, we need some purely formal criterion for deciding what grammars 
G are good for arbitrary strings v and what are not. 

Let us state some formal definitions. Context-fr ee grammar G wi l l be c alled a grammar 
(more precisely, admissible grammar) for string v (cf . iKieffer and Ya^j[ EqOO) if: 



1 . For each nonterminal bi there is exactly one production gi such that (bi i— > gi) E G. 

2. Nonterminal £>o expands into v if we recursively substitute productions g; for all bj. 

The set of all admissible grammars for v will be denoted by F(v). Each grammar G E F(v) is 
allowed to produce only one derivation, which is the finite text v itself. In contrast, context-free 
grammars producing a single infinite derivation are known as L-systems. 

Some a priori criterion for deciding which admissible grammars approximate t he correct 



token izations of texts makes us e of the principle of minimum description length (IRissanen . 



1978; Lehman and Shelat, 2002). Define the length leng; of production g, as the total number 



of its terminal and nonterminal symbols, e.g. \enshb1b4b2ifb4cb1b2b?, = 12 and lena£>3Z?2 = 3. 
According to the principle of minimum description length, the best grammar for string v is 
grammar G MDL (v) having the minimal length, 

G MDL (v) := argminlenG, (23) 

GeF(v) 

where the length of a grammar is the total length of all its productions, 

lenG:= £ leng,-. (24) 

(bi^ gi )eG 

Strictly speaking, there can be more than one grammar having the minimal length, so object 
G MDL (v) is slightly indeterminate. 

Grammar G MDL (v) usually cannot be computed in a reasonable amount of time but there 
is a multitud e of heuristic algorithms which compute grammars whose lengths approximate 
lenG MDL (v) dLehmanl.120021: iLehman and Shelatll2002h . Various algorithms for computing the 
approximations of G MDL (v) usually perform some kind of local search on set F(v) and output so 



called irreducible grammars. Grammar G is called irreducible (Kieffer and Yang, 2000, section 
3.2) if: 
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1. Each nonterminal expands recursively into a different string of terminals. 



2. Each nonterminal except for bo appears at least twice in productions g ; . 



3. There is no string y of leny > 2 which appears more than once in productions g,. 

It can be shown that there is an irreducible grammar for v whose length equals mm GeF ^ len G. 
Hence, we can assume that G MDL (v) is irreducible. 

Various algorithms for computing the i rreducible approximations of G MDL (v) have been 
tested empirically on natural language data. Wolff (1980) . iNevill-M anning (1996), and de Marcken 
(1996) reported that those algorithms return quite sound representations of English texts. The 
nonterminals of some irreducible approximations of G MDL (v) can be interpreted as syllables, 
morphemes, words, and fixed phrases. Some of the heuristic algorithms identify the correct 
boundaries of about 90% of orthographi c words in the Br own corpus, in a text deprived of 
spaces, capitalization, and p unctuation dde Marcken . 1996b . Here is an example of the com- 
puted tokenization given by lde Marc ken: 



for the pur pose of maintain ing inter nation al 



peace and pro mot ing the advance ment of all 



people the united states ofamerica joined 



in f ound ing the un it ed nat i on s . 

The results of the automatic tokeniza tion are especially impressive for strongly isolating 
languages, such as English and Chinese (de Marcken, 1996). The same algorithms need not 
be so effective for highly inflective languages, where numerous orthographic alternations occur 
within the morphological stems (e.g. for Polish). The pursuit for better tokenization algorithms 
cannot be separated fr om the quest for the data compression algorithms whic h identify the 
inflec tional paradigms ( Goldsmith 2001 ) or the abstract phrase syntax structures ( Nowak et al 

hood 



4 The shortest grammar and excess entropy 

Let us denote the set of the non-initial rules of grammar G as Gq .= G\ {bo i— > go}, where 
A \B is the difference of sets A and B. We will call Go the vocabulary of G. The length of the 
vocabulary is defined as 

lenGo:= ^2 leng,- = lenG — lengo- (25) 

We use notation lenG vIDL (v) := lenG MDL (v) -leng vIDL (v) respectively. 

If the average length of the word-like productions gi, i ^ 0, does not depend significantly on 
the text then we may suppose that G vIDL (v) is proportional to the number of word types in text 
v. In fact, we can ob serve an analog of Guiraud's law ©. If we look at the data published by 
Nevill -Manning ( 1996, figure 3.12 (b), p. 69), we can observe empirical proportionality 

lenG SEQUITUR (v)oc(lenv)«, (26) 
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where 1/2 < a < 1 and Gq E( ^ UITUR (v) is some approximation of G^ DL (v) computed by the 
algorithm called SEQUITUR. 

In this section, we would like to present some general theoretical result. We shall relate the 
length of Gj^ DL (v) to the finite-order excess entropy. It is well known that there are intimate 
relations between block entropy and the e xpected lengths of some codes used in data compres- 
sion. In particular. iKieffer and Ya ng ( 2000) discuss the concept of grammar-based codes, which 
represent strings v G V + as uniquely decodable binary strings C(v) G {0, 1} + by the mediation 
of the admissible grammars. 

Let F = |JveV+ ^( v ) ^ e me set °f admissible grammars for all strings. Function C : V + — > 
{0,1 } + is called a grammar-based code if 

C(v)=B(G c (v)), (27) 

where grammar transform G c computes grammar G c (v) G F(v) and grammar encoder B repre- 
sents any grammar G G F as a unique binary string B(G) G {0, 1} + . 

Let us introduce the expected length of code C for the strings of length n drawn from sta- 
tionary distribution P, 

H c (n) := £ P(v) -lenC(v). (28) 

v€V" 

Code C is called universal (more precisely, weakly minimax universal) if 

H c (n)>H(n), (29) 
lim H c (n) /n = lim H(n) /n (30) 

n — >oo n — >o° 

for any stationary distribution P. See lCover and Thomas! (|l99lL sections 5.1-6 and 12.10) for 
a general background in information and coding theory. 

Additionally, let us call C an irreducible code if for each input string v G V + , grammar 
G (v) is irreducible. Kie ffer and Yang (2000, theorem 8) prove the following result: 



Theorem 1 There exists such grammar encoder B that any irreducible code of form (12 7D is 
weakly minimax universal. 

It is a very strong and profound theorem. In particular, code MDL(v) := B(G MDL (v)) is univer- 
sal since the shortest grammar G MDL (v) is ir reducible. Theor em [D ca n be used to prov e univer- 

section 



sal since me shortest grammar Lr [y) is irreducible. ineoremLUcan be used to prove 
sality of the modified SEQUITUR code hv lNevill-Manningl (Kieffe r and YangL f2000. 



6.2). Universality of the famous Lempel-Ziv code, however, is proved differently since it is not 



an irreducible code and it uses a different grammar encoder ( Cover and Th omas. 1991 
12.10). 



section 



It has been checked empirically t hat co des whose grammars are shorter usually enjoy shorter 



lengths. For instance, iGrassberg en (|2002|) compressed 135 GB of English text and obtained 
compression rates (in bits per character) lenLZ(v)/lenv ~ 2.6 for Lempel-Ziv code LZ and 
lenNSRPS(v)/lenv ~ 1.8 for some heuristic irreducible code NSRPS. Other researchers re- 
ported comparable results ddeMarckenlll996h . 

By analogy to definition (fT3l of finite-order excess entropy E(n), let us introduce the ex- 
pected excess code length 

E c (n) :=2H c (n)-H c (2n) 

= £ P(vM)[lenC(v)+lenC(«)-lenC(vH)]. (31) 
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Theorem 2 For any weakly minimax universal code C inequality 



E c (n) > E(n) (32) 

is true for infinitely many n. (See appendix\A[for the proof. ) 

Inequality (l32l) is valid in particular for C = MDL or for any irreducible code. 

Now, we shall link the expected excess code length E MDL (n) with the length of MDL 
vocabulary. Let L m (v) := lenG MDL (v) be the length of the shortest grammar and Lq (v) : = 
lenG^ IDL (v) be the length of its vocabulary. Define L >l (v) as the maximal length of a string 
which appears in string v at least twice. 
Theorem 3 We have inequalities 



L m (v)<lenv, (33) 

(34) 
(35) 



L m (v) <lenv, 
L m (v),L m (u) < L m {vu) +L >l (vu), 
< L m {v) +L m (u) -L m (vu) < !%{vu) + L >l (vu). 



(See appendix^for the proof. ) 

Inequality (135! states that the vocabulary length for the shortest grammar cannot be roughly 
less than the excess length of the shortest grammar. In a slightly heuristic reasoning, we shall 
argue that the excess length of the shortest grammar multiplied by a slowly growing function 
cannot be less than the excess length of code MDL. In order to do it we need some pretty strong 
symmetrical bound for the length of code MDL in terms of the length of the shortest grammar. 

It is known that function B of Theor em D] sat isfies lenB(G) < y(lenG), where y(n) := n ■ 
(c + logrc) for some constant c (|Kieffer and YangLl2000L section 4). The following symmetrical 
bound for code MDL seems probable: 

Conjecture 4 There is inequality 

|lenMDL(v) - y(L m (v))| < f 2 (L m (v)), (36) 

where j(n) :=n- f\ (n) and functions f>0 satisfy < f(n + 1) — f(n) < ct/nfor some con- 
stants Ci. 

Now we can give a bound for the excess length of code MDL in terms of the excess length 
of the shortest grammar. 

Theorem 5 If Conjecture^\is true then 

lenMDL(v) +lenMDL(a) -lenMDL(vw) 

L >l (vu) 



< [L m (v)+L m (u)-L m (vu)+di 



d 2 + c\ loglenvM + c\ 



L m (vu) 



+ ciL >l (vu), (37) 



where d i =3c2/c\ and d2 = max(/i(l),/2(l)ci/c2). (See appendix^for the proof.) 

Recall that H MDh (n)/n = £ v€ y P(v) • L m (v) / lenv approaches entropy rate h for n —> °° by 
Theorem[l] We may speculate that h > for the language production. Let us assume a stronger 
statement, namely, that 

lenv < d 3 L m (v) (38) 

for some constant di, and (almost) every human text v. On the other hand, notice that L >1 (v) < 
lenv follows by definition of L >1 (v). By these two inequalities, we have L >l {yu)/L m (vu) < d?,. 
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Combining the latter with d37t and d35t gives 

lenMDL(v) + lenMDL(w) - lenMDL(vw) 

< [L%{yu)+L >l {yu) + di] [ d 4 +c\ loglenvw] , (39) 

where J4 : = d.2 + c\ ( J3 + 1 ) . Averaging (1391) with P(vw) for v,«G V", we obtain 

[L^[2n]+L >1 [2n]+J 1 ] [rf 4 + cilog(2n)] > £ MDL (n), (40) 

where 

Zff[n] := £ P(v) .lenZff(v), Z>V] := £ P(v) -lenL^v)- (41) 

veV" veV" 

By inequality (l40l) and Theorem El we also have 

[l^^l+L^^J+di] [rf 4 + cilog(2n)] >E(n) (42) 

for infinitely many n. In particular, if stationary distribution P obeys Hilberg's law then 
inequality 

I%[n] +L >1 [n] > const -rf j log n (43) 
holds for infinitely many n by equation (fT9l) . 

5 Hilberg's law and Guiraud's law 

In this section, we would like to make the final step in deriving Guiraud's law from relation 
(l43l . First, let us have a closer look at Guiraud's and Zipf's laws. It is widely-known that if 
Zipf's law © holds w i th the same B for all A f then Guiraud's law d2l) i s satis fied with p = l/B 



for large A^, cf. Korna i (2002, section 3.2) or lFerrer i Cancho and Sol e (2001). 



In fact, the number of word types V and the number of word tokens Af can be computed 
given the word frequencies, 

y= £ 1, #= £ c(w), (44) 

w.c(w)>0 w.c(w)>0 

so any relation between V and Af is a function of the exact distribution of frequencies c(w). 
The converse is not true. In general, frequency c(w) cannot be computed given only w, V, and 
N since different texts usually have different keywords. Still, we may seek for hypothetical 
derivations of formula © given formula © and some additional assumptions. 

One could ask if Guiraud's law or Zipf's law do hold with the same p or B for texts of various 



size and origin. The answer is complex. For instance, Kornai (2002, section 2.5) discusses 
Guiraud's law extensively and according to the plot in his article value p ~ 0.75 holds perfectly 
for samples of sizes N e [ 1.4 • 10 5 , 1.8 • 10 7 ] drawn from San Jose Mercury News corpus. Such 
value of p would correspond to B fa 1.33 if formula © with constant B held for all word ranks. 
Nevertheless, if we investigate the rank-frequency plot for so large collections of texts, we 
encounter a different regularity. 



Ferrer i Cancho and Sold (|2001|) discovered that parameter B in formula © depends on 
word rank r(w). For multi-author corpora there are two regimes where B is almost constant. 
Namely, we have 

B= \Bu 0<r( W )< Rl , 
\B 2 , Ri<r(w), 
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where B2 < B\ pa 1. Let us note that for sufficiently short text collections (those with V < R\) 
only one of two regimes can be observed. For single-author corpora and r(w) > R\, we have an 
exponential decay of c(w) rather than a power-law. 

In another case of some multi-auth or collection of English texts counting 1.8 • 10 8 word 



tokens, Montemu rro and Zanette (2002) reported B\ pa 1, B% pa 2.3 and R\ pa 6000. The inves- 
tigated collection is only 10 times larger than SJMN corpus surveyed by Kornai. If formula 
© with constant B pa 2.3 held for all word ranks then we would have Guiraud's law © with 
p pa 0.43. Anyway, if there are two regimes of 5, like in (l45t . then we could obtain Guiraud's 
law © with p ^ 0.75 for all /V if also parameter R\ depends on the text length N. Until we have 
more experimental data on the dependence between /V and R\, we can be only sure that there is 
inequality 

V > const -N 0A3 . (46) 

Let V(v) be the number of orthographic word types in text v and N(v) — the number of 
orthographic word tokens therein. If we assume that the mean length of the word tokens in text 
v does not change substantially with v then text length N(v) measured in orthographic words is 
proportional to text length lenv measured in phonemes or letters, 

N(v)oc\env. (47) 

In view of section |3j we may suppose that the number of orthographic word types V(v) 
is proportional to the n umber of the production rules in the shortest grammar G MDL (v), cf. 
Nevill-Manning (1996, figure 3.12 (c) vs. (a), p. 69). If the mean length of the non-initial 



productions does not change substantially against v then the number of th e rules is proportional 
to length Lq (v) of the vocabulary of the shortest grammar G MDL (v), cf. Nevill-Manning l (fl* 



figure 3.12 (a) vs. (b), p. 69). Resuming, we would have proportionality 

V(v)ocL%(v). (48) 

Assuming relations d47t and d48l) . we can restate Guiraud's law d46l) as 

L'o(v) > const -(lenv) 043 , (49) 

which resembles relation (1261) reported by iNevill'- Manningl Except for the effects of averaging 
and the negligible length L >l [y) of the longest substring appearing more than once, inequality 
(l49l is implied by inequality d43t with the very rough estimate fj pa 1/2 done by Hilberg. We 
could say that Hilberg's law can be some explanation of Guiraud's law. Let us discuss the 
plausibility of such explanation. 

Zipf's law is often understood as a specific algebraic relationship between the counts and 
ranks of various objects — not necessarily words. In such gen eralization, Z ipf's law is observed 
also out of the linguistic domain, e.g. in income distribution (IParetoLll897l) . We do not know if 
one can find a general explanation of Zipf's law both in linguistic and non-linguistic contexts. 
Explaining Zipf's law in the purely linguistic context seems somehow easier. One needs "only" 
to assign some reasonable relative frequency P(v) to every string v of phonemes and then to 
define how any finite string v should be cut into words. The existence or nonexistence of relation 
© should follow by pure mathematical d eduction from these two assumptions. 

That idea inspired M andelbrot! (1 1 95 31) to formulate some classical explanation of Zipf's law. 
His assumptions are: 

1. Stationary distribution P is an IID distribution, i.e. it satisfies (fTBT) . 
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2. Set V of atomic symbols is the set of phonemes and spaces. The word tokens in any text 
are defined as the space-to-space strings of phonemes. 

Given these assumptions Mandelbrot derived Zipf's law for space-to-space words and hence 
Guiraud's law can be inferred as well. In fact, Mandelbrot did not discuss Guiraud's law but, 
as we have said, Zipf's law does imply Guiraud's law automatically. Mandelbrot's explana- 
tion assuming the existence of "int e rmitten t silences" was quoted o r redisc overed by many 
researchers, e.g. hv lBelevitchl dl956h . iMilleil dl957h.lBell et all dl99(]h andQ dl992h . There is 
some historical summary of that literature done byO (11998). 

Although Mandelbrot's explanation of Zipf's and Guiraud's laws earned some popularity 
among natural scientists, we should stress that both of its assumptions are false with respect to 
the intended application to natural language. First, we would object to modeling human lan- 
guage production by an IID distribution. Second, Mandelbrot's definition of word is biased by 
the spelling conventions of the most popular alphabetic scripts which use blank spaces to sepa- 
rate words. No regula r ' 'intermittent silences ' ' appear in the spoken versions of the correspond- 



ing ethnic languages (Jelinek, 1997). That phenomenon is a challenge for automatic speech 
recognition and it motivated some interest i n the shortest admi ssible grammars as a means for 
restoring the boundaries between the words dde MarckenLll996h . 

In this article, we present another explanation of Guiraud's law. Our assumptions are: 

1. Stationary distribution P exhibits Hilberg's law (HJ for all n. 

2. We may assume that V is a set of phonemes only. The word tokens in any text are defined 
as the nonterminal tokens of the shortest admissible grammar. 

We think that the derivation of Guiraud's law based on Hilberg's law is better linguistically 
justified than the classical explanation by Mandelbrot. There are several reasons for that claim: 

1. The new explanation assumes that human narration exhibits strong probabilistic depen- 
dence, it is not a IID distribution. In appendix |H1 we recall that no infinitary distribution 
P can be modeled by a stationary hidden Markov chain with a finite number of hid- 
den state s. This fact can have some important implications for computational linguistics 



2. The new explanation does not assume the pre-existence of spaces between the words in 
the natural language production. Children can learn the correct tokenization of speech 
into the words even if they do not know yet what the words are. 

3. Space-to-space words for the IID distributions do not have any definite internal structure. 
It is no longer true for the new explanation. The nonterminals of the shortest grammar 
exhibit the internal structure of recursive rule productions. Such nonterminals have well- 
defined parts. Without any change of the model, we can speak not only of Guiraud's 
and Zipf's laws for the words but we can also discuss laws which relate words to their 
elements. Some example of the latter is Me nzerath ' s law, which states tha t the longer the 
word is the shorter its constituents are {Menzerath, 1928; Altmann, 1 98c3). By means of 
the grammar-based codes one can define the structure of word-like objects and investigate 
many quantitative linguistic laws not only for the language production but also for any 
other stationary distributions. 

4. Stationary distribution is called ergodic (roughly) if the relative frequency of any fixed 
word does not vary significantly acro ss different texts. By some theorem, every IID dis- 
tribution is ergodic ( De bowskiL I2005L chapter 4). Nevertheless, empirical studies do not 
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corroborate Mandelbrot's assumption that language production P is ergodic. The mere 
existence of concept "the keywords of the text" reflects the fact that different texts use 
different vocabularies systematically. Words, once they appear in some text, tend to reap- 
pear. Let us stress that some significant variation of the word frequencies can be modelled 
by non- ergodic stationary distributions . Many non-ergodic stationary distributions are in- 
finitary (De bowskil I2005L chapters 4 and 5), see also appendix |B] It is an interesting 
question whether Hilberg's law (HJ) implies non-ergodicity of stationary distribution P. 
Some further discussion of Hilberg's law and non-ergodic distributions could give us in- 
sight where to seek general quantitative laws in the intertext variability of language. Any 
such laws would be of great importance to computational linguistics as well. 



6 Conclusions 



In this article, we have discussed some implications of (Hilberg's ( 1990) hypothesis on the en 



tropy of natural language production. That hypothesis states that finite-order excess entropy 
E(n) of the rc-letter strings is proportional to the square root of n. So far, the proportionality 
has been roughly verified only for n < 50. On the other hand, we have argued that Hilberg's 
hypothesis, when extrapolated to n of the text length magnitude, provides a better explanation 
of Guiraud's law than the classical explanation based on the existence of "intermittent silences" 



(Mandelbrot! 119531) . 

The new explanation is based on two points. First, we observe that the tokenization of 
a text into orthographic words and their morphemes matches largely the production rules of the 
shortest admissible grammar for the text. Second, we use some partially heuristic, but largely 
deductive, mathematical reasoning to argue that the length of the non-initial production rules of 
the shortest grammar cannot be less than finite-order excess entropy. 

In the future research, the rough match of the linguistically-motivated tokenizations and 
the tokenizations given by the shortest grammars should be surveyed as one of the fundamental 
problems of quantitative linguistics. One should survey Zipf 's, Guiraud's, and Menzerath's laws 
for the nonterminals of the admissible grammars and the orthographic words simultaneously 
across a large range of text sizes and languages. Proportionalities fflh and (l48t should be 
verified as well. 

It seems that the existence of a rich formal structure in the natural language production is 
reflected by its high total excess entropy E rather than by simply positive entropy gain H(\)—h. 
We think that the further discussion of Hilberg's hypothesis can improve the quality of statistical 
language models bot h in quantitative and computational linguistics, see appendix IB1 and our 



doctoral dissertation (Debowski, 2005) 



Since the shortest admissible grammars reproduce also the internal structure of words, the 
behavior of excess entropy might be linked not only with Guiraud's and Zipf's laws but also 
with Menzerath's law. The shortest grammars can be used a s the definition of words and their 



witn Menzeratn s law. ine snortest grammars can be used as me definition ot words and ther 
constituents in any symbolic string ( Nevill-M anning. 1996). Adopting such a definition, em 



pirical researchers can survey the form of Guiraud's, Zipf's, and Menzerath's laws also in the 
non-linguistic symbolic data (such as DNA). Last but not least, mathematicians can prove some 
rigorous theorems. 
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A Proofs 

Proof of Theorem |2 For any function / we have identity 



m— 1 



£ 2f(2 k n)-f(2 k+l n) 



k=0 



2 k+l 



ft \ 

fin) — n 



2 m n 



(50) 



for each finite m. Hence, if (l3*0l) is true then we obtain 



#(n)-/m=£ 2H(2 k n)-H(2 k+l n 

k=0 



2 k - 



E(2 k n) 



H c {n)-hn = £ 2H c (2 k n)-H c (2 k+l n 



k=0 



2 k 



k=0 

1 _ £ £ c (2 A n) 



2 *H 



(51) 
(52) 



Because of inequality (l29l) . we have —hn< H c (n) — /m so 

y, E(2 k n) ~ E c (2 k n) 



2 k+\ 



(53) 



k=0 " fc=0 
If we put n = 2 P M with any p and some fixed M then d53t yields 



£ E c (2 k M)-E(2 k M) ^ () 



2* H 



(54) 



Assume that E c (2 k M) —E{2 k M) > holds only for finitely many k. Then we would have 
E c {2 k M) -E(2 k M) < for all k > p and some p. Hence, we would have 



~ E c (2 k M)-E(2 k M) 
£ ^r-i- 1 < 0. 



k=p 



2 k+l 



(55) 



Since d55t stays in contradiction with (154b . our assumption that E c (2 k M) — E(2 k M) > only 
for finitely many k was false. We must have E c (2 k M) —E(2 k M) > for infinitely many k, and 
this is exactly inequality (l3*2l) which we were to prove. □ 

Proof of Theorem |3 In order to prove (l33t . notice that G = {bo i— > v} is a grammar for v. Its 
length satisfies lenv = lenG < lenG MDL (v) by <|24l> and d23l . 

Now, let us prove (l34l and d35t . Since vocabulary G^ dl (vm) cannot beat vocabularies 
G^ DL (v) and Gj^ DL (w) in the efficient representation of any strings v and u respectively, we 
observe inequalities 



len G MDL (v) < len g L + len G^ DL (vu) 
lenG MDI » < leng* + lenGjf DL (vw) 



(56) 
(57) 



where Gj^ DL (vw) U {bo i— > #l} and G^ DL (va) U {£>o >— > } are some grammars for v and u re- 
spectively. Analogically, 



lenG MDL (ra) < lenG MDL (v) +lenG MDL ( M ) 
since G^ DL (v) U G^ Dh (u) U {& ^ g[f DL (v)gJf DI » } is a grammar for vu. 



(58) 
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Assume that gt and gR are obtained by splitting the initial production g^ Dh (vu) into two 
parts and recursively expanding the nonterminal at the border if necessary. That is, we have 
either g L g R = gJf DL (vw) or g L = y L x L , g R = x R y R , and gjf DL (vw) = y L biy R , where nonterminal 
bi expands recursively into string xix R G V + . Grammar G mdl (vm) is irreducible so we must 
have lenxLX R < L >l (vu), where L >l (vu) is the maximal length of a string which appears in 
string vu at least twice. Thus, 

|leng L + len^-len^ DL ( W )| <L >l (vu). (59) 

By (l59b . adding (EH) and d57j yields 

lenG MDL (v) +lenG MDI » < lcng^ DL (vu) +2\cn G^ DL (vu) +L >l (vu) 

= len G MDL (vu) + len Gjf DL ( vu) + L> 1 ( vu) . (60) 

In fact, we can rewrite (l60l) and d58t as (l35l) . By d59l . we also have leng L , leng^ < lengQ tDL (v«) + 
L >1 (vu). Inserting these two inequalities into (I5BT) and (157b respectively yields (l3*4"l) . □ 

Proof of Theorem |H According to Conjecture we have 



lenMDL(v) +lenMDL(w) -lenMDL(vM) 

< y(L'») +y(L'») -j(L m (vu)) +f 2 (L m (v)) +f 2 (L m (u)) +f 2 (L m (vu)) 

By < fi(n + 1) - /}(n) < c ; /n and OH), there is 



fi(n) < fi(l) + £ Cf/k < Ml) + Ci logn, 

k=2 

f>(L m (v)) < /^(vu^+c^^/L'^vu). 

Hence by d33t . 

y(L m (v))+tL m (u))-y(L m (vu)) 

< [L m (v) +L m (u) -U n (vu)}h(U n (vu)) +ci [L m (v) +L m (u)} 



>1, 



VU) 



= [L m (v)+L m (u)-L m (vu)} 
< [L m (v)+L m (u)-L m (vu)] 
On the other hand, 



ML m (vu))+ Ci 
/i(lenva) +c\ 



[vu 



L m (vu) 
L >l (vu)~ 



L m (vu) 
+ ciL >l (vu) 



L m (vu) 



+ ciL >l (vu). (64) 



f 2 (L m (v))+f 2 (L m (u)) +f 2 (L m (vu)) < 3f 2 (L m (vu))+2c 2 L >l (vu)/L m (vu) 

L >l (vuY 



<3 



f 2 (\envu)+c 2 



L m (vu) _ 



Inserting d64j>, d65j), and <|62l> into (EB we obtain (l37b . 



(61) 

(62) 
(63) 



(65) 
□ 
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B Some properties of infinitary distributions 



Infinitary distributions seem to be a new interesting class of the stochas tic models for h uman 
narration. The mathematics of excess entropy is just being developed, cf. De bowskil (|2005l) for 
an overview. Our program is to bring together some advanced results of mathematics (measure- 
theoretic probability theory, coding theory) and some quantitative linguistic intuitions. We can 
give a linguistic interpretation to some mathematical theorems and a formal language to express 
some vague hypotheses about the obscure nature of probabilistic language models. 

We would like to mention four facts about infinitary distributions which can be important 
for quantitative and computational linguistics in the view of Hilberg's hypothesis. These are: 

1 . There are infinitary distributions which are not deterministic stationary distributions. That 
is, total excess entropy E = °° does not imply entropy rate h = 0. 

2. All stationary distributions which consist in a random descr iption of some infinite random 
object must be infinitary and nonergodic (Debowski, 2005L chapter 5). 

Hence, we may suppose that E = °° holds for the stationary distribution of the language 
production because almost every human text refers systematically to a different and po- 
tentially infinite fictitious world. 

3. For some infinitary distributions , value P(v) can be co mputed for every string v by some 
finite procedure, cf. Berthe (1994) and Gramss ( 1994). 



4. N o infinitary d istribut ion can be rep r esented by a finite-state hidden Markov m odel (HMM) 



Upper ( 1997f). ICover and Thoma s ( 1991, section 2.8, 



cf.lCrutchfield and FeldmarJ 
data processing inequality). 

In spite of their inadequacy as the models of infinitary distributions, finite-state HMMs are 
the standard heuristic models of natural language engineering. It happens so only for the 
necessity of the effective search for the most probable h idden states. S ome well-known 
applications of HM Ms are automatic speech recognizers ( Jelinek [|l997h and trigram part- 
of-speech taggers ^Manning and Schiitze . 19991: Debowskil l2004b|) . It was observed that 
the error rate of trigram taggers decreases as a negative power of the size of the training 
data. When we incre ase the training data size ten times, the error rate diminishes only by 
half (Megyesi, 2001). In fact, such power-law de cay of the error rate can be also some 



consequence of Hilberg's law (Biale k et all 12001) 



The lack of space disallows us to exactly explain the terminology and the reasons for the math- 
ematical facts mentioned above. We will try to popularize some ideas of our thesis among the 
linguistic audience in the next articles. 
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